Matrix Multiplication

Overview

Matrix multiplication is one of the most fundamental operations in scientific computing. It represents the composition of linear mappings, symbolizing spatial transformations and rotations. This operation finds extensive applications in various fields, such as encryption and decryption in cryptography, simulation of input-output models in mathematical modeling, and serving as an essential computational tool for advanced algorithms. Therefore, accelerating the computation of matrix multiplication is a crucial problem.

In the FIR lab, we focused on explaining the design philosophy of hardware optimization, providing a preliminary understanding of the emphasis in hardware design. In this chapter, we take it a step further, demonstrating how to design an efficient matrix multiplication accelerator by improving computational structures, optimizing data access, and enhancing parallelism.

Matrix blocking enables better utilization of on-chip memory resources on FPGAs by partitioning large matrices into smaller blocks. Blocked algorithms on FPGAs can be pipelined and parallelized to exploit fine-grained parallelism inherent in FPGA architectures. This results in improved throughput and reduced latency for matrix operations

The goal is to enhance the computation speed for matrices of size 128*128 or even larger. We will compare the speed with the matrix multiplication operation in the Python Numpy library, visibly boosting the speed from 0.0571 seconds in software to 0.0021 seconds (block matrix architecture) , achieving nearly 20 times faster, respectively.

Part	Topic	Description	Environment
1	Software Implementation	Run a matrix multiplication in Numpy	Jupyter Notebook
1	Software Implementation	Test the computation speed on Prosessing System	Jupyter Notebook
2	HLS Kernel Programming	Optimize Data Access with Array Partitioning	AMD Vitis HLS 2023.2
		Optimize On-Chip Memory Utilization and Latency with Matrix Blocking
		Optimize Area Efficiency with Arbitrary Precision
		Optimize Latency with Loop Unrolling and Pipelining
3	System-level Integration	Create the overlay by Integrating the IP with Zynq processing system	Jupyter Notebook
		Load the overlay and run the application on the PYNQ framework
		Visualize the results and analyze the performance